NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Zorro: Quantifying Uncertainty in Models & Predictions Arising from Dirty Data

https://doi.org/10.1145/3722212.3725143

Hu, Kaiyuan; Zhu, Jiongli; Glavic, Boris; Salimi, Babak (June 2025, ACM)

Free, publicly-accessible full text available June 22, 2026
Stress-Testing ML Pipelines with Adversarial Data Corruption

https://doi.org/10.14778/3749646.3749721

Zhu, Jiongli; Xu, Geyang; Lorenzi, Felipe; Glavic, Boris; Salimi, Babak (July 2025, Proceedings of the VLDB Endowment)

Structured data-quality issues—such as missing values correlated with demographics, culturally biased labels, or systemic selection biases—routinely degrade the reliability of machine-learning pipelines. Regulators now increasingly demand evidence that high-stakes systems can withstand these realistic, interdependent errors, yet current robustness evaluations typically use random or overly simplistic corruptions, leaving worst-case scenarios unexplored. We introduce Savage, a causally inspired framework that (i) formally models realistic data-quality issues through dependency graphs and flexible corruption templates, and (ii) systematically discovers corruption patterns that maximally degrade a target performance metric. Savage employs a bi-level optimization approach to efficiently identify vulnerable data subpopulations and fine-tune corruption severity, treating the full ML pipeline, including preprocessing and potentially non-differentiable models, as a black box. Extensive experiments across multiple datasets and ML tasks (data cleaning, fairness-aware learning, uncertainty quantification) demonstrate that even a small fraction (around 5%) of structured corruptions identified by Savage severely impacts model performance, far exceeding random or manually crafted errors, and invalidating core assumptions of existing techniques. Thus, Savage provides a practical tool for rigorous pipeline stress-testing, a benchmark for evaluating robustness methods, and actionable guidance for designing more resilient data workflows.
more » « less
Free, publicly-accessible full text available July 1, 2026
Learning from Uncertain Data: From Possible Worlds to Possible Models

Zhu, Jiongli; Feng, Su; Glavic, Boris; Salimi, Babak (February 2025, NeurIPS 2024)

We introduce an efficient method for learning linear models from uncertain data, where uncertainty is represented as a set of possible variations in the data, leading to predictive multiplicity. Our approach leverages abstract interpretation and zonotopes, a type of convex polytope, to compactly represent these dataset variations, enabling the symbolic execution of gradient descent on all possible worlds simultaneously. We develop techniques to ensure that this process converges to a fixed point and derive closed-form solutions for this fixed point. Our method provides sound over-approximations of all possible optimal models and viable prediction ranges. We demonstrate the effectiveness of our approach through theoretical and empirical analysis, highlighting its potential to reason about model and prediction uncertainty due to data quality issues in training data.
more » « less
Free, publicly-accessible full text available February 13, 2026
Generating Interpretable Data-Based Explanations for Fairness Debugging using Gopher

https://doi.org/10.1145/3514221.3520170

Zhu, Jiongli; Pradhan, Romila; Glavic, Boris; Salimi, Babak (June 2022, ACM SIGMOD)

Full Text Available
Interpretable Data-Based Explanations for Fairness Debugging

https://doi.org/10.1145/3514221.3517886

Pradhan, Romila; Zhu, Jiongli; Glavic, Boris; Salimi, Babak (June 2022, ACM SIGMOD)

Full Text Available

Search for: All records